92 research outputs found
Transducer-based language embedding for spoken language identification
The acoustic and linguistic features are important cues for the spoken
language identification (LID) task. Recent advanced LID systems mainly use
acoustic features that lack the usage of explicit linguistic feature encoding.
In this paper, we propose a novel transducer-based language embedding approach
for LID tasks by integrating an RNN transducer model into a language embedding
framework. Benefiting from the advantages of the RNN transducer's linguistic
representation capability, the proposed method can exploit both
phonetically-aware acoustic features and explicit linguistic features for LID
tasks. Experiments were carried out on the large-scale multilingual LibriSpeech
and VoxLingua107 datasets. Experimental results showed the proposed method
significantly improves the performance on LID tasks with 12% to 59% and 16% to
24% relative improvement on in-domain and cross-domain datasets, respectively.Comment: This paper was submitted to Interspeech 202
Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR
Due to the modality discrepancy between textual and acoustic modeling,
efficiently transferring linguistic knowledge from a pretrained language model
(PLM) to acoustic encoding for automatic speech recognition (ASR) still remains
a challenging task. In this study, we propose a cross-modality knowledge
transfer (CMKT) learning framework in a temporal connectionist temporal
classification (CTC) based ASR system where hierarchical acoustic alignments
with the linguistic representation are applied. Additionally, we propose the
use of Sinkhorn attention in cross-modality alignment process, where the
transformer attention is a special case of this Sinkhorn attention process. The
CMKT learning is supposed to compel the acoustic encoder to encode rich
linguistic knowledge for ASR. On the AISHELL-1 dataset, with CTC greedy
decoding for inference (without using any language model), we achieved
state-of-the-art performance with 3.64% and 3.94% character error rates (CERs)
for the development and test sets, which corresponding to relative improvements
of 34.18% and 34.88% compared to the baseline CTC-ASR system, respectively.Comment: Submitted to ICASSP 202
Speech Dereverberation Based on Integrated Deep and Ensemble Learning Algorithm
Reverberation, which is generally caused by sound reflections from walls,
ceilings, and floors, can result in severe performance degradation of acoustic
applications. Due to a complicated combination of attenuation and time-delay
effects, the reverberation property is difficult to characterize, and it
remains a challenging task to effectively retrieve the anechoic speech signals
from reverberation ones. In the present study, we proposed a novel integrated
deep and ensemble learning algorithm (IDEA) for speech dereverberation. The
IDEA consists of offline and online phases. In the offline phase, we train
multiple dereverberation models, each aiming to precisely dereverb speech
signals in a particular acoustic environment; then a unified fusion function is
estimated that aims to integrate the information of multiple dereverberation
models. In the online phase, an input utterance is first processed by each of
the dereverberation models. The outputs of all models are integrated
accordingly to generate the final anechoic signal. We evaluated the IDEA on
designed acoustic environments, including both matched and mismatched
conditions of the training and testing data. Experimental results confirm that
the proposed IDEA outperforms single deep-neural-network-based dereverberation
model with the same model architecture and training data
Cross-modal Alignment with Optimal Transport for CTC-based ASR
Temporal connectionist temporal classification (CTC)-based automatic speech
recognition (ASR) is one of the most successful end to end (E2E) ASR
frameworks. However, due to the token independence assumption in decoding, an
external language model (LM) is required which destroys its fast parallel
decoding property. Several studies have been proposed to transfer linguistic
knowledge from a pretrained LM (PLM) to the CTC based ASR. Since the PLM is
built from text while the acoustic model is trained with speech, a cross-modal
alignment is required in order to transfer the context dependent linguistic
knowledge from the PLM to acoustic encoding. In this study, we propose a novel
cross-modal alignment algorithm based on optimal transport (OT). In the
alignment process, a transport coupling matrix is obtained using OT, which is
then utilized to transform a latent acoustic representation for matching the
context-dependent linguistic features encoded by the PLM. Based on the
alignment, the latent acoustic feature is forced to encode context dependent
linguistic information. We integrate this latent acoustic feature to build
conformer encoder-based CTC ASR system. On the AISHELL-1 data corpus, our
system achieved 3.96% and 4.27% character error rate (CER) for dev and test
sets, respectively, which corresponds to relative improvements of 28.39% and
29.42% compared to the baseline conformer CTC ASR system without cross-modal
knowledge transfer.Comment: Accepted to IEEE ASRU 202
- …